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SPARSITY IN MULTIPLE KERNEL LEARNING 

By Vladimir Koltchinskii 1 and Ming Yuan 2 
Georgia Institute of Technology 

The problem of multiple kernel learning based on penalized em- 
pirical risk minimization is discussed. The complexity penalty is de- 
termined jointly by the empirical L2 norms and the reproducing ker- 
nel Hilbert space (RKHS) norms induced by the kernels with a data- 
driven choice of regularization parameters. The main focus is on the 
case when the total number of kernels is large, but only a relatively 
small number of them is needed to represent the target function, so 
that the problem is sparse. The goal is to establish oracle inequalities 
for the excess risk of the resulting prediction rule showing that the 
method is adaptive both to the unknown design distribution and to 
the sparsity of the problem. 



1. Introduction. Let (Xi,Yi),i = 1, . . . , n be independent copies of a ran- 
dom couple (X, Y) with values in S x T, where S is a measurable space 
with u-algebra A (typically, S is a compact subset of a finite-dimensional 
Euclidean space) and T is a Borel subset of M. In what follows, P will 
denote the distribution of (X, Y) and IT the distribution of X. The cor- 
responding empirical distributions, based on (Xi, Y\), . . . , (X n , Y n ) and on 
(Xi, . . . ,X n ), will be denoted by P n and II n , respectively. For a measurable 
function j:5xThR, we denote 



Pg:= / gdP = Eg(X,Y) and P n g := / gdP n = n" 1 \ g(Xj, Yj). 



Similarly, we use the notations IT/ and IT n / for the integrals of a function 
/ : S 1— > M. with respect to the measures IT and II n . 
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The goal of prediction is to learn "a reasonably good" prediction rule 
/ : S — > R from the empirical data {{Xi,Yi) : i = 1, 2, . . . , n}. To be more spe- 
cific, consider a loss function l:Txl-> R + and define the risk of a predic- 
tion rule / as 

P{tof)=Et{YJ{X)), 

where {to f){x,y) = t{y, f{x)). An optimal prediction rule with respect to 
this loss is defined as 

/* = argminP(£o /), 

where the minimization is taken over all measurable functions and, for sim- 
plicity, it is assumed that the minimum is attained. The excess risk of a 
prediction rule / is defined as 

£{tof):=P{tof)-P{tof«). 

Throughout the paper, the notation a x b means that there exists a nu- 
merical constant c > such that c _1 < | < c. By "numerical constants" we 
usually mean real numbers whose precise values are not necessarily speci- 
fied, or, sometimes, constants that might depend on the characteristics of 
the problem that are of little interest to us (e.g., some constants that depend 
only on the loss function). 

1.1. Learning in reproducing kernel Hilbert spaces. Let TLk be a repro- 
ducing kernel Hilbert space (RKHS) associated with a symmetric nonneg- 
atively definite kernel K :S x S — > R such that for any x £ S, K x {-) := 
K(-,x) € Hk and f{x) = {f,K x ) HK for all / 6 H K [Aronszajn (1950)]. If 
it is known that if /* £ Hk and ||/*||^ K < 1, then it is natural to estimate 
/* by a solution / of the following empirical risk minimization problem: 

1 n 

(1) /:=argmin-V^,/(X0). 

ll/ll* K <i n ^ 

The size of the excess risk £{to /) of such an empirical solution depends 
on the "smoothness" of functions in the RKHS %k- A natural notion of 
"smoothness" in this context is related to the unknown design distribution 
n. Namely, let Tk be the integral operator from L2{H) into -^(n) with ker- 
nel K. Under a standard assumption that the kernel K is square integrable 
(in the theory of RKHS it is usually even assumed that S is compact and 
K is continuous), the operator Tk is compact and its spectrum is discrete. 
If {Afc} is the sequence of the eigenvalues (arranged in decreasing order) 
of Tk and {4>k} is the corresponding L2(n)-orthonormal sequence of eigen- 
functions, then it is well known that the RKHS-norms of functions from the 
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linear span of {<pk\ can be written as 



y K/^fc)L 2 (n)l 

k>l 



which means that the "smoothness" of functions in %k depends on the rate 
of decay of eigenvalues Afc that, in turn, depends on the design distribution II. 
It is also clear that the unit balls in the RKHS T-Lk are ellipsoids in the space 
L 2 (II) with "axes" 

It was shown by Mendelson (2002) that the function 

/ \ 1/2 

%(5):= (n-^AfeAS 2 )) , 6 €[0,1], 
^ k>i ' 

provides tight upper and lower bounds (up to constants) on localized Rade- 
macher complexities of the unit ball in T~Lk an d plays an important role in 
the analysis of the empirical risk minimization problem (1). It is easy to 
see that the function 7^(\/5) is concave, 7 n (0) = and, as a consequence, 
ln(S)/5 is a decreasing function of 5 and X f n (5)/S 2 is strictly decreasing. 
Hence, there exists unique positive solution of the equation 7 n (<5) = 5 2 . If 6 n 
denotes this solution, then the results of Mendelson (2002) imply that with 
some constant C > and with probability at least 1 — e~ l 

£(£of)<c(s 2 n + - 

The size of the quantity 5 2 involved in this upper bound on the excess risk 
depends on the rate of decay of the eigenvalues A& as k — > oo. In particular, if 
Afc x k~ 2/3 for some (3 > 1/2, then it is easy to see that %(5) X n" 1 / 2 ^" 1 ^ 2 ^ 
and 5 2 x n~ 2 ^^ 2 ^ +1 \ Recall that unit balls in Hk are ellipsoids in i^O-D 
with "axes" of the order k~@ and it is well known that, in a variety of 
estimation problems, n~ 2 ^^ 2 ^ +1 ^ represents minimax convergence rates of 
the squared L2-risk for functions from such ellipsoids (e.g., from Sobolev 
balls of smoothness /3), as in famous Pinsker's theorem [see, e.g., Tsybakov 
(2009), Chapter 3]. 

Example. Sobolev spaces W a,2 (G),G C R d of smoothness a > d/2 is a 
well-known class of concrete examples of RKHS. Let T d ,d > 1 denote the 
<i-dimensional torus and let n be the uniform distribution in T d . It is easy 
to check that, for all a > d/2, the Sobolev space W a ' 2 {T d ) is an RKHS 
generated by the kernel K(x, y) = k{x — y),x,y G T, where the function k € 
Z/2(T rf ) is defined by its Fourier coefficients 

k n = {\n\ 2 + l)~ a , n = ( ni ,...,n d ) £Z d , \n\ 2 := n\ + • • • + n% 

In this case, the eigenfunctions of the operator Tk are the functions of the 
Fourier basis and its eigenvalues are the numbers {(|n| 2 + l)~ a : n S Z d }. For 
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d = 1 and a > 1/2, we have Xk x k~ 2a (recall that {A/-} are the eigenvalues 
arranged in decreasing order) so, (3 = a and 5 2 >c n ~ 2a / ( 2a + 1 ) ; which is a 
minimax nonparametric convergence rate for Sobolev balls in W a ' 2 (T) [see, 
e.g., Tsybakov (2009), Theorem 2.9]. More generally, for arbitrary d > 1 and 
a > d/2, we get j3 = a/d and 5 2 x n - 2a /( 2a +d) ; which is also a minimax 
optimal convergence rate in this case. Suppose now that the distribution II 
is uniform in a torus T d C T d of dimension d' < d. We will use the same 
kernel K, but restrict the RKHS 1~Lk to the torus T d of smaller dimension. 
Let d" = d — d! . For n G Z d , we will write n = (n', n") with n' G Z d , n" € Z rf . 
It is easy to prove that the eigenvalues of the operator Tk become in this 
case 

(|n'| 2 + \n"\ 2 + l)~ a x (|n'| 2 + l)-( a ~ d "/ 2 ). 

n"ez d " 

Due to this fact, the norm of the space Hk (restricted to T d ') is equiv- 
alent to the norm of the Sobolev space W a ~ d "^ 2 ' 2 (T d '). Since the eigen- 
values of the operator Tk coincide, up to a constant, with the numbers 
{(\n'\ 2 + l)-(°- d "/2) :n ' G Z d '}, we get P n x n -(2a-d")/(2a-d»+d') ■ [which is 
again the minimax convergence rate for Sobolev balls in W a ~ d / 2,2 (T rf )]. 
In the case of more general design distributions II, the rate of decay of the 
eigenvalues and the corresponding size of the excess risk bound 5 2 de- 
pends on II. If, for instance, II is supported in a submanifold S C T d of 
dimension dim(S') < d, the rate of convergence of 5 2 to depends on the 
dimension of the submanifold S rather than on the dimension of the ambient 
space T d . 

Using the properties of the function j n , in particular, the fact that 7 n (5) /S 
is decreasing, it is easy to observe that j n (6) < 5 n 5 + 5 2 , 5 £ (0, 1]. Moreover, 
if e = e(K) denotes the smallest value of e such that the linear function 
e5 + e 2 , 5 G (0, 1] provides an upper bound for the function j n (5),5 G (0, 1], 
then e <5 n < 2(\/5 — l) _1 e. Note that e also depends on n, but we do not 
have to emphasize this dependence in the notation since, in what follows, n 
is fixed. Based on the observations above, the quantity 5 n coincides (up to 
a numerical constant) with the slope e of the "smallest linear majorant" of 
the form eS + e 2 of the function 7 n (5). This interpretation of 5 n is of some 
importance in the design of complexity penalties used in this paper. 

1.2. Sparse recovery via regularization. Instead of minimizing the empir- 
ical risk over an RKHS-ball [as in problem (1)], it is very common to define 
the estimator / of the target function /* as a solution of the penalized 
empirical risk minimization problem of the form 



(2) /:=argmin 

fen 



1 

i=l 
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where e > is a tuning parameter that balances the tradeoff between the 
empirical risk and the "smoothness" of the estimate and, most often, a = 2 
(sometimes, a = l). The properties of the estimator / has been studied 
extensively. In particular, it was possible to derive probabilistic bounds on 
the excess risk £ (£o /) (oracle inequalities) with the control of the random 
error in terms of the rate of decay of the eigenvalues {A^}, or, equivalently, in 
terms of the function 7 n [see, e.g., Blanchard, Bousquet and Massart (2008)]. 

In the recent years, there has been a lot of interest in a data dependent 
choice of kernel K in this type of problems. In particular, given a finite 
(possibly large) dictionary {Kj : j = 1, 2, . . . , N} of symmetric nonnegatively 
definite kernels on S, one can try to find a "good" kernel K as a convex 
combination of the kernels from the dictionary: 

(3) K e K := j J^A} :0j > 0,#i + • • • + 6^ = 1 j- 

The coefficients of K need to be estimated from the training data along with 
the prediction rule. Using this approach for problem (2) with a = 1 leads to 
the following optimization problem: 

(4) / := argmin(P„(£o /) + e||/||«J. 

KeK. 

This learning problem, often referred to as the multiple kernel learning, has 
been studied recently by Bousquet and Herrmann (2003), Crammer, Keshet 
and Singer (2003), Lanckriet et al. (2004), Micchelli and Pontil (2005), Lin 
and Zhang (2006), Srebro and Ben-David (2006), Bach (2008) and Koltchin- 
skii and Yuan (2008) among others. In particular [see, e.g., Micchelli and 
Pontil (2005)], problem (4) is equivalent to the following: 



(5) 



Vi,...,f N ):= argmin \P n (£o (ft + • • . + f N )) 



3 



N 

■^2,\\fj\\H K 



which is an infinite-dimensional version of LASSO-type penalization. Koltchin- 
skii and Yuan (2008) studied this method in the case when the dictionary 
is large, but the target function /* has a "sparse representation" in terms 
of a relatively small subset of kernels {Kj :j £ J}. It was shown that this 
method is adaptive to sparsity extending well-known properties of LASSO 
to this infinite-dimensional framework. 

In this paper, we study a different approach to the multiple kernel learn- 
ing. It is closer to the recent work on " sparse additive models" [see, e.g., 
Ravikumar et al. (2008) and Meier, van de Geer and Biihlmann (2009)] and 



G 



V. KOLTCHINSKII AND M. YUAN 



it is based on a "double penalization" with a combination of empirical L2- 
norms (used to enforce the sparsity of the solution) and RKHS-norms (used 
to enforce the "smoothness" of the components). Moreover, we suggest a 
data-driven method of choosing the values of regularization parameters that 
is adaptive to unknown smoothness of the components (determined by the 
behavior of distribution dependent eigenvalues of the kernels). 

Let Uj := n Kj ,j = l,...,N. Denote H := l.s.fljjli n j) (" Ls -" meaning 
"the linear span"), and 

«W := {(hi, ...,h N ):hje Hj,j = 1,...,N}. 

Note that / £ "H if and only if there exists an additive representation (pos- 
sibly, nonunique) / = /1 + h /tv, where fj E Hj, j = 1, . . . , N. Also, 

has a natural structure of a linear space and it can be equipped with the 
following inner product: 

N 

• • • ) In), (91, ■■ -,9N))n( N ) '■= y~](fj>9j)'Hi 

i=i 

to become the direct sum of Hilbert spaces Hj, j = 1, . . . , N. 

Given a convex subset D C H^ N ', consider the following penalized empir- 
ical risk minimization problem: 



(/l, ■■■,/#) = argmin 

(fi,...,f N )eD 

(6) 



V 



Il/jlli2(n n ) +£ 2 j\\fj\\Hj 



Note that for special choices of set D, for instance, for D := • • • , Jn) '■ fj G 
Hj, WfjWHj ^ Rj} f° r some Rj > 0,j = 1, . . . , N, one can replace each compo- 
nent fj involved in the optimization problem by its orthogonal projections 
in T~Lj onto the linear span of the functions {Kj(-,Xi),i = 1, . . . ,n} and re- 
duce the problem to a convex optimization over a finite-dimensional space 
(of dimension nN). 

The complexity penalty in the problem (6) is based on two norms of 
the components fj of an additive representation: the empirical L2-norm, 
H/jllwnV); with regularization parameter tj, and an RKHS-norm, 
with regularization parameter e|. The empirical L2-norm (the lighter norm) 
is used to enforce the sparsity of the solution whereas the RKHS norms (the 
heavier norms) are used to enforce the "smoothness" of the components. 
This is similar to the approach taken in Meier, van de Geer and Biihlmann 
(2009) in the context of classical additive models, that is, in the case when 
5 := [0, 1]^, %j := W a,2 ([0, 1]) for some smoothness a > 1/2 and the space 
"Hj is a space of functions depending on the jth variable. In this case, the 
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regularization parameters €j are equal (up to a constant) to n~ a ^ 2a+1 \ The 
quantity e 2 , used in the "smoothness part" of the penalty, coincides with 
the minimax convergence rate in a one component smooth problem. At the 
same time, the quantity 6j, used in the "sparsity part" of the penalty, is 
equal to the square root of the minimax rate (which is similar to the choice 
of regularization parameter in standard sparse recovery methods such as 
LASSO). This choice of regularization parameters results in the excess risk 
of the order dn~ 2a / ( 2a+1 ) ; where d is the number of components of the target 
function (the degree of sparsity of the problem). 

The framework of multiple kernel learning considered in this paper in- 
cludes many generalized versions of classical additive models. For instance, 
one can think of the case when S := [0, l] mi x • • • x [0, l] mjv and Hj = 
W a ' 2 ([0, l] m,J ') is a space of functions depending on the jth. block of vari- 
ables. In this case, a proper choice of regularization parameters (for uni- 
form design distribution) would be ej = n~ a ^ 2a+mj \ j = 1, . . . , N (so, these 
parameters and the error rates for different components of the model are 
different). It should be also clear from the discussion in Section 1.1 that, if 
the design distribution LI is unknown, the minimax convergence rates for the 
one component problems are also unknown. For instance, if the projections 
of design points on the cubes [0, l] mj are distributed in lower-dimensional 
submanifolds of these cubes, then the unknown dimensions of the subman- 
ifolds rather than the dimensions nij would be involved in the minimax 
rates and in the regularization parameters €j. Because of this, data driven 
choice of regularization parameters ej that provides adaptation to the un- 
known design distribution LI and to the unknown "smoothness" of the com- 
ponents (related to this distribution) is a major issue in multiple kernel 
learning. From this point of view, even in the case of classical additive mod- 
els, the choice of regularization parameters that is based only on Sobolev 
type smoothness and ignores the design distribution is not adaptive. Note 
that, in the infinite-dimensional LASSO studied in Koltchinskii and Yuan 
(2008), the regularization parameter e is chosen the same way as in the 

classical LASSO (e x y ^fp^ Oi so i it i s n °t related to the smoothness of 
the components. However, the oracle inequalities proved in Koltchinskii and 
Yuan (2008) give correct size of the excess risk only for special choices of 
kernels that depend on unknown "smoothness" of the components of the 
target function /*, so, this method is not adaptive either. 

1.3. Adaptive choice of regularization parameters. Denote 

K := ( Kj{X u X k ) 
3 ' \ n 

This n x n Gram matrix can be viewed as an empirical version of the inte- 
gral operator Tj^j from L2(LI) into ^(LT) with kernel Kj. Denote , k = 
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1,2, ... , the eigenvalues of Kj arranged in decreasing order. We also use the 
notation Ajr p, k = 1,2,..., for the eigenvalues of the operator : ^2(n) 
i>2(n) with kernel arranged in decreasing order. Define functions 7^ , 7n , 

^W-^D^A^ 172 and #W:=^E(^^^y /2 , 
and, for a fixed given A > 1, let 
(7) e j: =infJe> 



^^:7^W< e 5 + e 2 ,V^(0,l]|. 



One can view ij as an empirical estimate of the quantity ej = e(Kj) that 
(as we have already pointed out) plays a crucial role in the bounds on the 
excess risk in empirical risk minimization problems in the RKHS context. 
In fact, since most often €j > yMlog N/n, we will redefine this quantity as 



(8) ej := inf { e > J^iE : $)(S) < e5 + e 2 , G (0, 1] 



We will use the following values of regularization parameters in prob- 
lem (6): €j = T€j, where r is a sufficiently large constant. 

It should be emphasized that the structure of complexity penalty and the 
choice of regularization parameters in (6) are closely related to the following 
bound on Rademacher processes indexed by functions from an RKHS Hk- 
with a high probability, for all h £ Hk, 

I^WI < C[e(A')||/z|| i2(n) +e 2 (/OII^II^]- 

Such bounds follow from the results of Section 3 and they provide a way 
to prove sparsity oracle inequalities for the estimators (6). The Rademacher 
process is defined as 

n 

Rn(f) -n-^ejfiXj), 

3=1 

where {ej} is a sequence of i.i.d. Rademacher random variables (taking val- 
ues +1 and —1 with probability 1/2 each) independent of {^j}- 

We will use several basic facts of the empirical processes theory through- 
out the paper. They include symmetrization inequalities and contraction 
(comparison) inequalities for Rademacher processes that can be found in 
the books of Ledoux and Talagrand (1991) and van der Vaart and Well- 
ner (1996). We also use Talagrand's concentration inequality for empirical 
processes [see, Talagrand (1996), Bousquet (2002)]. 
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The main goal of the paper is to establish oracle inequalities for the ex- 
cess risk of the estimator f = fi + ■ ■ ■ + /n ■ In these inequalities, the ex- 
cess risk of / is compared with the excess risk of an oracle / := f% + • • • + 
In, • • • , In) G D with an error term depending on the degree of sparsity 
of the oracle, that is, on the number of nonzero components fj G T~Lj in its 
additive representation. The oracle inequalities will be stated in the next 
section. Their proof relies on probabilistic bounds for empirical L2-norms 
and data dependent regularization parameters £j. The results of Section 3 
show that they can be bounded by their respective population counterparts. 
Using these tools and some bounds on empirical processes derived in Sec- 
tion 5, we prove in Section 4 the oracle inequalities for the estimator /. 

2. Oracle inequalities. Considering the problem in the case when the 
domain D of (6) is not bounded, say, D = T~L^ N \ leads to additional technical 
complications and might require some changes in the estimation procedure. 
To avoid this, we assume below that D is a bounded convex subset of ft It 
will be also assumed that, for all j = 1, . . . , N, sup xGS Kj (x,x) < 1 , which, by 
elementary properties of RKHS , implies that || fj \\ < \\ fj \\ u . , j = 1, . . . , N. 
Because of this, 

R D := sup \\fi-\ h /at 1 1 Loo < +°°- 

(fi,...,fN)eD 

Denote R* D := Rd V H/*!^^- We will allow the constants involved in the 
oracle inequalities stated and proved below to depend on the value of R* D 
(so, implicitly, it is assumed that this value is not too large). 

We shall also assume that N is large enough, say, so that log N > 2 log log n. 
This assumption is not essential to our development and is in place to avoid 
an extra term of the order n _1 log log n in our risk bounds. 

2.1. Loss functions of quadratic type. We will formulate the assumptions 
on the loss function £. The main assumption is that, for all y £ T, £(y, •) is a 
nonnegative convex function. In addition, we will assume that £(y, 0), y G T is 
uniformly bounded from above by a numerical constant. Moreover, suppose 
that, for all y G T, £(y, ■) is twice continuously differentiable and its first and 
second derivatives are uniformly bounded in T x [— R* D ,R* D \. Denote 

(9) m(R) := - mi ml — — — , M (ft) := - sup sup 



2y£T\ u \<R du 2 ^y£T\u\<R 9u 2 

and let m* := m(R* D ),M :¥ := M(R* D ). We will assume that m* > 0. 
Denote 



sup 



d£ 



Clearly, for all j/gT, the function £(y,-) satisfies Lipschitz condition with 
constant L*. 



10 



V. KOLTCHINSKII AND M. YUAN 



The constants m*, M*, L* will appear in a number of places in what fol- 
lows. Without loss of generality, we can also assume that < 1 and > 1 
(otherwise, m* and can be replaced by a lower bound and an upper 
bound, resp.). 

The loss functions satisfying the assumptions stated above will be called 
the losses of quadratic type. 

If £ is a loss of quadratic type and / = /i H h /at> (/lj • • • j /jv) G D, then 

(10) m*||/ - M|| 2(n) < S(£of) < M4f - Ml! 2(n) - 

This bound easily follows from a simple argument based on Taylor expansion 
and it will be used later in the paper. If T~L is dense in L2(II), then (10) implies 
that 

(11) iniP(£of) = inf P(£ o f) = P(£ o /A 
f&u K J ' / G L 2 (n) v ^ v 7 

The quadratic loss £(y,u) := (y — u) 2 in the case when T C K is a bounded 
set is one of the main examples of such loss functions. In this case, m(R) = 1 
for all R > 0. In regression problems with a bounded response variable, more 
general loss functions of the form £(y, u) := (ft(y — u) can be also used, where 
cj) is an even nonnegative convex twice continuously differentiable function 
with 4>" uniformly bounded in R, (f)(0) = and (p"(u) >0,w£K. In classifi- 
cation problems, the loss functions of the form £(y, u) = (j)(yu) are commonly 
used, with being a nonnegative decreasing convex twice continuously dif- 
ferentiable function such that, again, (ft" is uniformly bounded in M and 
4>"(u) > 0,u £ R. The loss function 4>(u) = log 2 (l + e~ u ) (often referred to 
as the logit loss) is a specific example. 

2.2. Geometry of the dictionary. Now we introduce several important ge- 
ometric characteristics of dictionaries consisting of kernels (or, equivalently, 
of RKHS). These characteristics are related to the degree of "dependence" 
of spaces of random variables %j C L2(H),j = 1,...,N and they will be 
involved in the oracle inequalities for the excess risk £(£o f). 

First, for J C {1, . . . , N} and b S [0, +oo] , denote 

Cf ■= Uhi,...,h N ) e : £ \\ hj || i2(n) < 6 ^ ||^|| L2(n )}. 

Clearly, the set Cf is a cone in the space that consists of vectors 

(hi, . . . , /ijy) whose components corresponding to j £ J "dominate" the rest 
of the components. This family of cones increases as b increases. For 6 = 0, 
Cj 6 ^ coincides with the linear subspace of vectors for which hj = 0, j ^ J. 
For b = +oo, C?' is the whole space 
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The following quantity will play the most important role: 

/3 2 , 6 (J;n):=/3 2)6 (J) 

N 



: = inf(/3>0: fellM£.CH)) 7 < Z 3 



i 2 (n) 



(/*,.. .,frjv)€C7j. 

Clearly, /3 2 ^(J;II) is a nondecreasing function of b. In the case of "simple 
dictionary" that consists of one-dimensional spaces similar quantities have 
been used in the literature on sparse recovery [see, e.g., Koltchinskii (2008, 
2009a, 2009b, 2009c); Bickel, Ritov and Tsybakov (2009)]. 

The quantity /3 2j b(J;II) can be upper bounded in terms of some other ge- 
ometric characteristics that describe how "dependent" the spaces of random 
variables Hj E L 2 (II) are. These characteristics will be introduced below. 

Given hj G %j, j = 1, . . . , N, denote by n({hj :j E J}) the minimal eigen- 
value of the Gram matrix ((hj,hk) L 2 (U))j,keJ- Let 



(12) k(J) := M{ K ({hj : j E J}):hj G Hj, 

We will also use the notation 



\h 



■j\\L 2 (U) 



!}• 



(13) 



Uj = \.s. 



The following quantity is the maximal cosine of the angle in the space L 2 (II) 
between the vectors in the subspaces T~Lj and T~L j for some I,Jc {1, . . . , iV}: 

(/>s)z 2 (n) 



(14) p(7,J):=sup 



:f€Hi,geHj,f^0,g?0 



lL 2 (n)ll5llL 2 (n) 

Denote p(J) := p(J, J c ). The quantities p(I, J) and p(J) are very similar to 
the notion of canonical correlation in the multivariate statistical analysis. 

There are other important geometric characteristics, frequently used in 
the theory of sparse recovery, including so called "restricted isometry con- 
stants''' by Candes and Tao (2007). Define (5^(11) to be the smallest 5 > such 
that for all (hi,..., h N ) E H (JV) and all J E {1, . . . , N} with card(J) = d, 

1/2 _ / N i/a 

<(l + «)fel 



(1-5) 



Ei 

i6J 



illi 2 (n) 



< 



E^ 



l/j -II 2 



This condition with a sufficiently small value of (5^(11) means that for all 
choices of J with card(J) = d the functions in the spaces %j,j E J are 
"almost orthogonal" in L 2 (LX). 

The following simple proposition easily follows from some statements in 
Koltchinskii (2009a, 2009b), (2008) (where the case of simple dictionaries 
consisting of one-dimensional spaces Tij was considered). 



12 



V. KOLTCHINSKII AND M. YUAN 



Proposition 1. For all Jc {1,...,N}, 

/32oo(J;n) < 1 . 

Also, if card(J) = d and ^(n) < ^, then /^(Jjn) — 4. 

Thus, such quantities as /?2,oo(>/;n) or /^(JjII), for finite values of 6, 
are reasonably small provided that the spaces of random variables Hi, j = 
1,...,N satisfy proper conditions of "weakness of correlations." 

2.3. Excess risk bounds. We are now in a position to formulate our main 
theorems that provide oracle inequalities for the excess risk £ {I of). In 
these theorems, £(£ o f) will be compared with the excess risk S{1 o f) of 

an oracle (/i, . . . , fj\r) 6 D. Here and in what follows, / := /i H h /jv £ H. 

This is a little abuse of notation: we are ignoring the fact that such an 
additive representation of a function / G H is not necessarily unique. In 
some sense, / denotes both the vector (/i, . . . , f^) £ and the function 

fi H + /at € However, this is not going to cause a confusion in what 

follows. We will also use the following notation: 

Jf ■■={!< 3 <N:fj^0} and d(f) := card(J/). 

The error terms of the oracle inequalities will depend on the quantities 
€j = e(Kj) related to the "smoothness" properties of the RKHS and also 
on the geometric characteristics of the dictionary introduced above. In the 
first theorem, we will use the quantity l^2,oo{Jf] n) to characterize the prop- 
erties of the dictionary. In this case, there will be no assumptions on the 
quantities €j\ these quantities could be of different order for different ker- 
nel machines, so, different components of the additive representation could 
have different "smoothness." In the second theorem, we will use a smaller 
quantity f3 2j b(J',Tl) for a proper choice of parameter b < oo. In this case, we 
will have to make an additional assumption that e,- , j = 1 , . . . , N are all of 
the same order (up to a constant). 

In both cases, we consider penalized empirical risk minimization prob- 
lem (6) with data-dependent regularization parameters e,j = re,-, where ij,j = 
1,...,N are defined by (7) with some A > 4 and r > BL* for a numerical 
constant B. 



Theorem 2. There exist numerical constants C±,C2 > such that, for 
all all oracles (/i, . .. , fj\r) £ D, with probability at least 1 — 3N~ A / 2 , 

/ N N \ 

SVofi + C! [r^ejWfj - fj\\ L2{u) +r 2 ^||/ i ||«. 

V 3=1 3=1 / 
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(15) 

j&Jf v m * J 

This result means that if there exists an oracle (/i, . . . , /at) G D such that: 

(a) the excess risk £(£ o f) is small; 

(b) the spaces Hj,j G J/ are not strongly correlated with the spaces ^ 

(c) Hjjj G J/ are "well posed" in the sense that n(Jf) is not too small; 

(d) £ ^/ are all bounded by a reasonable constant, 

then the excess risk £(£o /) is essentially controlled by YljeJf^j' ^ ^ ne 
same time, the oracle inequality provides a bound on the L2(n)-distances 
between the estimated components fj and the components of the oracle (of 
course, everything is under the assumption that the loss is of quadratic type 
and m* is bounded away from 0). 

Not also that the constant 2 in front of the excess risk of the oracle £(£of) 
can be replaced by 1 + 5 for any S > with minor modifications of the proof 
(in this case, the constant C2 depends on 5 and is of the order 1/5). 

Suppose now that there exists e > and a constant A > such that 

A _1 <%<A, j = l,...,N. 
e 

Theorem 3. There exist numerical constants C\,C2,b>0 such that, 
for all oracles (fx, . . . , /at) G D, with probability at least 1 — 3N~ A / 2 , 

r ( N N \ 

£(i°f) + T\ T ^ I' * " allien) + ^ E W&H J 

(16) 



j=i 3=1 

r2 



< 2£(£ o /) + C 2 Ar^ ( ^ bA2iJf ' U) d(f) + MM 

3&Jf 



As before, the constant 2 in the upper bound can be replaced by 1 + 5, but, 
in this case, the constants C2 and b would be of the order t. The meaning 
of this result is that if there exists an oracle (/1, . . . , /at) G D such that: 

(a) the excess risk £(£ o /) is small; 

(b) the "restricted isometry" constant (^(n) is small for d = d(f); 

( c ) II/jIIKj-jJ ^ Jf are an bounded by a reasonable constant, 

then the excess risk £ (£0 f) is essentially controlled by d(f)e 2 . At the same 
time, the distance Wfj ~ fj\\L 2 (u) between the estimator and the oracle 
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is controlled by d(f)e. In particular, this implies that the empirical solution 
(/i>---)/jv) is "approximately sparse" in the sense that Ylj^j, ||/||L 2 (n) is 
of the order d(f)e. 

Remarks. 1. It is easy to check that Theorems 2 and 3 hold also if one 
replaces N in the definitions (7) of ij and (8) of ej by an arbitrary N > N 
such that log A? > 2 log log n (a similar condition on N introduced early in 
Section 2 is not needed here). In this case, the probability bounds in the 
theorems become 1 — 3N~ A / 2 . This change might be of interest if one uses 
the results for a dictionary consisting of just one RKHS (N = 1), which is 
not the focus of this paper. 

2. If the distribution dependent quantities €j,j = l,...,N are known and 
used as regularization parameters in (6), the oracle inequalities of Theo- 
rems 2 and 3 also hold (with obvious simplifications of their proofs). For 
instance, in the case when S = [0, 1]^, the design distribution II is uniform 
and, for each j = 1, . . . , N, T~Lj is a Sobolev space of functions of smoothness 
a > 1/2 depending only on the jth variable, we have ej >c n~ a l( 2a+l "> . Taking 
in this case 



would lead to oracle inequalities for sparse additive models is spirit of 
Meier, van de Geer and Biihlmann (2009). More precisely, if Hj := {h £ 
W a ' 2 [0, 1] : Jq 1 h(x) dx = 0}, then, for uniform distribution IT, the spaces Tij 
are orthogonal in £2(11) (recall that %j is viewed as a space of functions 
depending on the jth coordinate). Assume, for simplicity, that I is the 
quadratic loss and that the regression function /* can be represented as /* = 
J2je.J /*J> wnere J is a subset of {1, ... , N} of cardinality d and < 1. 

Then it easily follows from the bound of Theorem 3 that with probability 
at least 1 - 3N~ A / 2 



m = \\f-f4l m <Cr"d{n-^^y 



AlogN 



n 



Note that, up to a constant, this essentially coincides with the minimax 
lower bound in this type of problems obtained recently by Raskutti, Wain- 
wright and Yu (2009). Of course, if the design distribution is not necessarily 
uniform, an adaptive choice of regularization parameters might be needed 
even in such simple examples and the approach described above leads to 
minimax optimal rates. 

3. Preliminary bounds. In this section, the case of a single RKHS 1~Lk 
associated with a kernel K is considered. We assume that K(x, x) < 1, x E S. 
This implies that, for all h £ Hk, INIz, 2 (n) < WHl^ < WHhk- 
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3.1. Comparison of \\ ■ ||L 2 (n n ) an d II ' IU 2 (n)- First, we study the rela- 
tionship between the empirical and the population L2 norms for functions 
in Hk- 

Theorem 4. Assume that A>1 and logN > 2 log log n. Then there 
exists a numerical constant C > such that with probability at least 1 — N~ A 
for all h G Hk 

(17) \\h\\ L2(m <Cm\L^ n ) + e\\h\\H K )\ 

(18) ||^IU 2 (n„) < C(||^IU 2 (n) + ell^ll^), 
where 

e = e{K) 

(19) 



inf<! e> \l AlogN -e sup \R n (h)\ < e5 + e 2 , £ (0, 1] 
INIh„=i 



INIl 2 (ii)<<5 



Proof. Observe that the inequalities hold trivially when h = 0. We shall 
therefore consider only the case when h^O. By symmetrization inequality, 

(20) E sup |(n n -U)h 2 \ < 2E sup |i?n(/i 2 )| 



2- J <ll^lh 2 (n)<2-J +1 2-i<||h|| i2(n) <2-J+ 1 



INIh r . = 1 

^2 

and, by contraction inequality, we further have 

,2 



(21) E sup \{U n -U)h 2 \ <8E sup \R n (h)\. 

\W\n K =l 
2- J <l|ft||L 2( n)<2-J+ 1 

The definition of e implies that 



INk K =i \\h\\ nK =l 
2- J <l|/i||L 2 (n)<2-J + 1 2-J<||ft|| i2(n) <2-J+ 1 



2 



E sup \(U n -U)h 

\W\n K =l 
2- J <l|/i|h 2 (n)<2^ +1 

(22) 

<8E sup \R n (h)\ <8(e2~ J+1 + e 2 ). 

INlL 2 (n)<2^ +1 

An application of Talagrand's concentration inequality yields 

sup |(n n -n)/i 2 | 

\W\n K =l 
2- J <\\h\\L 2 (Tl)<2- j+1 
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<2[e sup |(n n -n)/i 2 

<-\\u K = 



2-J<l|fc|lwm<2-'+ 1 



+ 9 _ i+1 / t + 21ogj | t + 21ogj 



n n 



< 32 a-i + e - + 2-u t + 21ogJ + t + 21ogJ 



n n l 

with probability at least 1 — exp(— i — 2 log j) for any natural number j. Now, 
by the union bound, for all j such that 21og j < t, 

sup \(U n -U)h 2 \ 

\\Hn K =l 
2- J <l|ft||L 2 (n)<2-^ +1 

(23) 



/ 4 9 ,• /t + 21og7 t + 21og? 
< 32 e2 _J + e 2 + 2~\ — — + — — 



n n I 

with probability at least 

1- J2 exp(-t-21ogj) = l-exp(-t) ^ j~ 2 

j :21og j<t j :2log j<t 

(24) 

> l-2exp(-i). 

Recall that e > (A log N/n) 1 / 2 and ||/»|| La (n) < ||ft||«jr- Taking t = A log N + 
log4, we easily get that, for all h £ "H^ such that = 1 and ||/i||l 2 (ii) > 

exp{-A^ A / 2 }, 

(25) 1(11^ -IT)^ 2 | < C(e||^|U 2(n) +e 2 ) 

with probability at least 1 — 0.5iV and with a numerical constant C > 0. In 
other words, with the same probability, for all h G ~Hk such that innu^ > 
exp{-iV A / 2 }, 

(26) |(n„ - H)/! 2 ! < C(6||/ l || L2( n ) ||/ i |k if + e 2 |NI«J- 
Therefore, for all h S 7^ such that 

(27) W^) >exp( _ JV A/2 ) 
we have 

|| 2(n) = m 2 < \\h\\l 2{Un) + c{z\\h\\ Lm \\h\\ HK + e - 2 ||^||^), 
2 2(rin) = n^ 2 < ||fc||* a(n) + C(e||/.|U 2(n) + e 2 ||^ J. 
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It can be now deduced that, for a proper value of numerical constant C, 
\\h\\L 2 (u)< C(|HU 2( n„) + e\\h\\ nK ) and 

ll^dinj^Cdlfcll^dD+ell/ilkx)- 
It remains to consider the case when 

(29) M^5)<ex P (-iV^). 

Following a similar argument as before, with probability at least 1 — 0.5N~ A , 

sup \(U n -U)h 2 \ 
IWIw K =l 
ll^llL 2( n)<cxp(-^/ 2 ) 

< 16 f Ee*p(-N«*) + t + exp(-^/ 2 ) J^** + ^] . 
\ V n n I 

Under the conditions A > l,logiV > 2 log log n, 

(30) e ->(^^) 1/2 >exp(-iV^ ) . 
Then 

(31) sup \(U n - n)h 2 \ < Ce 2 

\\h\\u K =l 

\W\L 2 (n)<cM-N A / 2 ) 

with probability at least 1 — 0.5N~ A , which also implies (17) and (18), and 
the result follows. □ 

Theorem 4 shows that the two norms ||/t||i 2 (n„) an d H^Hz^n) are °f the 
same order up to an error term e||/i||% A , . 

3.2. Comparison ofe(K), e(K), e(K) ande(K). Recall the definitions 
/ oo \ V 2 

7n(<5):= U _1 Xj( AfeA<y2 )J ' *e(0,l], 

where {A^} are the eigenvalues of the integral operator Tk from ^2(11) into 
L2(II) with kernel K, and, for some A > 1, 



l/^^ : 7n(5)<e5 + e 2 ,V5G(0,l] L. 
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It follows from Lemma 42 of Mendelson (2002) [with an additional appli- 
cation of Cauchy-Schwarz inequality for the upper bound and Hoffmann- 
JOrgensen inequality for the lower bound; see also Koltchinskii (2008)] that, 
for some numerical constants Ci,C 2 > 0, 

/ n \ 1/2 

Ci[n -1 V(A fc A8 2 ) ) -n -1 <E sup \R n (h)\ 
V ti J IWI**=i 



U 2 (n)<5 



(32) 

<C 2 (n- 1 jT(\ k A5 2 



1/2 



k=l 



This fact and the definitions of e(K),e{K) easily imply the following re- 
sult. 

Proposition 5. Under the condition K(x,x) < l,x £ S, there exist nu- 
merical constants C\,C 2 > such that 

(33) Cie(K) < e(K) < C 2 e{K). 

If K is the kernel of the projection operator onto a finite-dimensional 

subspace T~Lk of L2(II), it is easy to check that e(K) x \J dim[ ^ K) (recall the 
notation axi, which means that there exists a numerical constant c > 
such that c" 1 < a/b < c). If the eigenvalues Afc decay at a polynomial rate, 
that is, A fc x /c~ 2/3 for some ^ > 1/2, then e(if) x n"^ 2 ^ 1 ). 
Recall the notation 

1/2 x 

(34) e(K):=mf{e> x r-^^:\ -> ](X k A5 2 ) I < e<5 + e 2 , V5 G (0, 1] I, 



AlogiV / 1 A K r2 . \ 
— : -^(A fc A«5 2 ) 

\ fc=i / 



where {Afc} denote the eigenvalues of the Gram matrix K := (K(Xi,Xj))ij = i t „^ n . 
It follows again from the results of Mendelson (2002) [namely, one can fol- 
low the proof of Lemma 42 in the case when the RKHS %k is restricted to 
the sample X\, . . . ,X n and the expectations are conditional on the sample; 
then one uses Cauchy-Schwarz and Hoffmann-Jorgensen inequalities as in 
the proof of (32)] that for some numerical constants C±,C 2 > 

/ \ 1/2 

CiU"T(A fe A5 2 ) -n^<E £ sup \R n {h)\ 
V k=i J \\Hn K =i 

\W\L 2 {n n )<6 

(35) 

/ n \ I/ 2 



<C 2 [n~ l ^T(\k/\S 2 ) 
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where E £ indicates that the expectation is taken over the Rademacher ran- 
dom variables only (conditionally on X\, . . . , X n ). Therefore, if we denote by 



(36) e(K):=iui\e>\ — :E e sup \R n (h)\ <e5 + e 2 , £ (0, 1] I 

I V n \\h\\u K =l J 

IHIl 2 (ii„)<<5 

the empirical version of e(K), then e{K) x e(K). We will now show that 
e(K) >c e(K) with a high probability. 

Theorem 6. Suppose that A > 1 and logN > 2 log log n. There exist 
numerical constants C\ , C2 > such that 

(37) de(K) < e(K) < C 2 e(K), 
with probability at least 1 — N~ A . 

Proof. Let t := A log N + log 14. It follows from Talagrand concentra- 
tion inequality that 

E sup \R n (h)\ 

IN«x =1 

2- J <ll^lh 2 (n)<2-^+ 1 



<2 sup |^)| + 2-^./ t + 2l0g ^ t + 21 °^ 



v \W\n K =l 

2- J <ll^llL 2 (n)<2--'+ 1 



n n 



with probability at least 1 — exp(— t — 21ogj). On the other hand, as derived 
in the proof of Theorem 4 [see (23)] 



2-J<IWIwm<2- J + 1 

(38) 



sup |(n n — n)/i 

\h\\n K =l 
i 2 ( n ) 



2 



< 32 [ a-' + e 2 + 2-U t + 2logj + t + 21ogJ 



n n 



with probability at least 1 — exp(— t — 21ogj). We will use these bounds only 
for j such that 2 log j < t. In this case, the second bound implies that, for 
some numerical constant c > and all h satisfying the conditions = 
l,2~ j < \\h\\ L2 (u) < we have ||/t||L 2 (n n ) < c(2~ J ' + e) (again, see the 

proof of Theorem 4). Combining these bounds, we get that with probability 
at least 1 — 2exp(— t — 21og j), 

E sup \R n (h)\ 

\W\n K =i 
2- j <INh 2( n)<2-^ +1 
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<2 sup \R n {h)\+2~^\ — ^ + 



\\h\\n K =l 
\h\\L 2 (n n )<cSj 



j+1 ft + 2logj i t + 2\ogj 



n n 



where Sj = e + 2~ J . 

Applying now Talagrand concentration inequality to the Rademacher pro- 
cess conditionally on the observed data Xi , . . . , X n yields 

sup |i2 n (/i)|<2[E e sup \R n {h)\ 

\\Hu K =i \ INIh x =i 

\W\ L2 (n n )<cSj INIz, 2 (n„)<c<5j 



V n n I 

with conditional probability at least 1 — exp(— t — 21ogj). From this and 
from the previous bound it is not hard to deduce that, for some numerical 
constants C,C and for all j such that 21ogj < t, 

E sup \Rn(h)\ 

\\h\\n K =i 

2- J <l|ft||i 2( n)<2^+ 1 



^n'iw ,n/ M | , , / t + 21ogj t + 21ogj \ 

<C E e SUp \R n {h) \+0j\ 1 

V \\h\\n K =l * n n j 

INIi 2 (n n )<c5j 

< C{iSj + e 2 ) < C{i2~ j + ee + e 2 ) 

with probability at least 1 — 3exp(— t — 21ogj). In obtaining the second 
inequality, we used the definition of e and the fact that, for t = A log N + 
log 14, 21ogj < t, c\e > (i + 21ogj'/n) 1 / 2 , where c\ is a numerical constant. 
Now, by the union bound, the above inequality holds with probability at 
least 

(39) 1-3 exp(-t - 2 log j) > 1 - 6 exp(-t) 

j:21ogj<t 

for all j such that 21ogj < t simultaneously. Similarly, it can be shown that 

E sup \R n (h)\ <C(eexp(-iV A/2 ) + ee + e 2 ) 

\\h\\n K =i 
ll^lh 2 (n)<cxp(-JV A / 2 ) 

with probability at least 1 — exp(— t). 
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For t = A log N + log 14, we get 

(40) E sup \R n {h)\ <C(e5 + ee + e 2 ), 

\\h\\u K =i 
\\h\\L 2 (n)<S 

for all < S < 1, with probability at least 1 - 7exp(-t) = 1 - N~ A /2. Now 
by the definition of e, we obtain 

(41) e^Cmax^ee + e 2 ) 1 / 2 }, 

which implies that e < Ce with probability at least 1 — N~ A /2. 
Similarly one can show that 

(42) E e sup \Rn(h)\ <C(e5 + ee + e 2 ), 

\\h\\n K =l 
\\h\\L 2 (n)<$ 

for all < S < 1, with probability at least 1 — N~ A /2, which implies that 
e < Ce with probability at least 1 — N~ A /2. The proof can then be completed 
by the union bound. □ 

Define 

e:=e(K) 

(43) 



:= inf< e > \ — — — : sup \R n (h)\ < e5 + e 2 , V«5 G (0, 1] I. 

l|/i||i 2 (n)<<5 

The next statement can be proved similarly to Theorem 6. 

Theorem 7. There exist numerical constants C\,C% > such that 
(44) Cie(K) < e{K) < C 2 e(K) 

with probability at least 1 — N~ A . 

Suppose now that {K\, . . . , K^} is a dictionary of kernels. Recall that 
Ej = e(Kj), ij = e(Kj) and lj = e(Kj). 

It follows from Theorems 4, 6, 7 and the union bound that with probability 
at least 1 - 3N~ A+1 for all j = 1, . . . ,N 

(45) 

||/i||L 2 (n„) < C(\\h\\ L2 (n) + ej\\h\\n K ), h£H 3 , 
(46) C±ej < ij < C 2 ej and C\lj < ej < C 2 ej. 
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Note also that 

3 N- A+1 = exp{ - (A - 1) log N + log 3} < exp{ - ( A/2) log N} = N~ A/2 , 

provided that A > 4 and iV > 3. Thus, under these additional constraints, 
(45) and (46) hold for all j = l,...,N with probability at least 1 - N~ A / 2 . 

4. Proofs of the oracle inequalities. For an arbitrary set J C {1, . . . , iV} 

and b £ (0, +oo), denote 

(47) 4 fe) :={(/i ) --- ) ^)^ (iV) :E^II^II^(n)< b E^II^II^(n)} 
and let 



f3 b (J) = inf j/3 > :J2"M\\l 2 (ii) < PWfi + ■■■ + /iv|U 2 (n), 

(48) 

i/i /*)< 4" 

It is easy to see that, for all nonempty sets J, /^(J) > maxjgjej > 

Theorems 2 and 3 will be easily deduced from the following technical 
result. 

Theorem 8. There exist numerical constants C±,C2,B > and b > 
such that, for all r > BL* in the definition of €j = T€j,j = 1, . . . ,N and for 
all oracles (/i, . . . , fjy) £ D, 

/ N N \ 

(49) £(< o /) + d £tS, \\f) - f^^ + £Vi*||/ i ||^ 

\j=i j=i ) 

(50) < 2£(t o /) + C 2 r 2 ( £ e 2 \\ fj + 

with probability at least 1 — 2>N~ A / 2 . Here, A > 4 is a constant involved in 
the definitions of €j, e,-, j = 1, . . . , N. 

PROOF. Recall that 

P«(*o (/! + •■• + /*)) 



(/l,---,/jv) := argmin 



N 

+ ^ Ti M\\L 2{ u n )+r 2 e 2 \\f J \\ nj 
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and that we write f ■= fi + • • • + In, f '■= fi + • • ■ + /jv- Hence, for all 

N 
J=l 

<^(^o/) + f;(re J ||/ J || L2(nii) +r 2 6 J 2 ||/ J ||^). 
i=i 

By a simple algebra, 

TV 

^o/)+E(«illAll^cn-)+^llAW 

A 7 " 

< £(£o /) + Y,(^M\\L 2{ n n ) + r 2 e)\\f 3 \\ H] ) 

+ \(P n - P )(eof-£of)\ 

and, by the triangle inequality, 

TV 

£(e° f) + E ™MU 2 (n n) + Y. tH WM 

+ E H +\(Pn-P)(£of-£of)\. 

We now take advantage of (45) and (46) to replace e/s by e^'s and || • 
Ili2(n n ) by || • ||_L 2 (n)- Specifically, there exists a numerical constant C > 1 
and an event E of probability at least 1 — N~ A / 2 such that 

(51) -<minj^:i = l,...,ivl < maxj : j = 1, . . . ,N} <C 



C 

and, for all j = 1, . . . , N, 

(52) ^Wfjhm -eMWn, < ll/ilU 2 (n„) <C(\\f)\\ Lm + ^11^ 
Taking r > C/(C — 1), we have that, on the event E, 

N 

£(iof) + E ^M\\L 2i n n) + E T2e 1ll^lk- 
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Vj^J/ 3=1 J 

Wj/ V ' 3=1 / 

\3iJf j=i / 

Similarly, 

/) + E " Allien.) + r 2 i 2 M\\n 3 ) 

jeJf 

< £(£of) + C 2 E K'll/i " /ill^cn-) + ^H/ilk) 

< £(*o /) + C 3 E "idl/j - /ilU 2( n) + eiH/i - /ilk) 

+c 2 E^n/,-ik- 

<£(£of) + C 3 E ^-(ll/i - fAL^ + ^Mll^ + eMlln,) 

+^ 2 E^ii/,-ik 

< o /) + 2C 3 E (reill/j " /ilU 2( n) + r 2 e)\\f 3 \\ H] ) 

Therefore, by taking r large enough, namely r > V (2C 6 ), we can find 
numerical constants < C\ < 1 < C2 such that, on the event E, 

S(£of) + cJj2 ^ll£lk(n) +E r2 ^H/ilk) 

< f (i o /) + C7 2 E ("ill/i - /ilU 2 (n) + r^WfjWn,) 

3&Jf 

+ \(p n -p)(eof-£of)\. 
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We now bound the empirical process | {P n — P) {I o f — £ o /) | , where we 
use the following result that will be proved in the next section. Suppose that 
/ = J2f=i fji fj ^ H-j and II/Hl^ < R (we will need it with R = R* D ). Denote 

f N 

I j=i 

N N 

^Z^jWdj -fjWuj < A+, 
i=i j=i 



Lemma 9. There exists a numerical constant C > swc/i t/iat /or an 
arbitrary A > 1 involved in the definition of ej,j = 1, . . . ,N with probability 
at least 1 - 2AT- A / 2 , /or a// 

(53) A_ < e^, A+ < 
i/ie following bound holds: 

(54) sup |(P«-P)(*o0-*o/)| <CL,(A_ + A + + e- 7V ). 

gee(A_,A+,fl| 5 ) 



Assuming that 

AT N 

(55) E^'ll^ " /iH^(n) < E e Hl/i " ^ e 



iV 



j'=i 

and using the lemma, we get 



3=1 



iV 



\jfJf 3=1 / 

< f(^o /) + C 2 ^ (re,-!!/,- - AlU^n) + ^ 2 ef II 11^ ) 

3'GJ/ 

AT 

+ C 3 L*Y,%\\fi ~ /ilU 2 (n) + - /ilk) + C^e" 

3=1 

< £(£ o /) + C 2 ]T (rejWfj - fjh m + r^H/ilk,-) 

3'GJ/ 

AT 

+ C3L^(e- i ||/ J -/ J || L2( n)+6-|||/ J || Wj +e2||/ J ||^.) 



A' 



+ C 3 L*e 



3=1 

iV 
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for some numerical constant C3 > 0. By choosing a numerical constant B 
properly, r can be made large enough so that 2C3L* < rC\ < tC 2 . Then, 
we have 

s(tof) + IcJy, reAfAkm + E^1ll/,lk) 

\j£J f j=l / 

(56) < £{l o /) + 2C 2 E (reM ~ /;IU 2 (n) + ^WfjU,) 

+ (C 2 /2)re~ N , 
which also implies 

/TV N \ 

S{iof) + -C, J2 TE M - /ilU 2 (n) + E r2e 1ll^lk' 

Vj=l j=l ) 

(57) < £(£0/) + Uc 2 + E "iH/i " /ilUaCn) 

+ 2C 2 T 2 E^II/,lk+(C2/2)T e " Ar . 

We first consider the case when 

4C 2 E ^-ll/j - fj\\ Lm > £(£ o f) + 2C 2 E r^WfjWn, 

(58) 

+ (C 2 /2)re- N . 

Then (56) implies that 

^o/H^i(E^II/ill^(n)+E^ill^H 

(59) 

< 6C 2 ^ ^-H/,- - /,|U 2(II) , 

which yields 

(60) E -reMhm < ^ E ™M - fihm- 

Therefore, (/ x - fx, . . . J N - f N ) E K,f f with b := 12C 2 /Ci. Using the defi- 
nition of (3t,(Jf), it follows from (57), (58) and the assumption C\ < 1 < C 2 



SPARSITY IN MULTIPLE KERNEL LEARNING 

that 

/TV TV \ 

\j=l j=l / 



6C 2 + ^)r^ b (J f )\\f-f\\ Lm 



< 

2 

< 7C 2 r(3 b (J f )(\\f - f4 L2(u) + ||/, - /|U 2(n) ). 
Recall that for losses of quadratic type 

(61) S(iof)>m4f-f4 2 Lm and £(£ o /) > m.\\f - /*||i 2 (n)- 
Then 

/TV TV \ 

< 7rC 2 m; 1/2 /3 6 ( J/)(£ 1/2 (£ o /) + £ l l\i o /)). 
Using the fact that a& < (a 2 + b 2 )/2, we get 

(62) 7TC 2 m- 1/2 p b {J f )S^ 2 (eof) < (49/2)r 2 C|m; 1 /3 6 2 (J / ) + i£(*o /) 
and 

(63) 7rC 2 m; 1/2 /3 ;) (J / ) < S 1 / 2 ^o /) < (49/2)T 2 C 2 m; 1 /3 2 (J/) + ° /)• 
Therefore, 

TV TV 

£{i o /) + C^TejWfjhm + Ci Ys^oWfM 

(64) 

^^o/J + lOO^Clm- 1 ^^/)- 
We now consider the case when 

(65) 

<S(£of) + 2C 2 E T^II/.H^ + (C 2 /2)re- 
It is easy to derive from (57) that in this case 



-TV 



(66) 



/TV TV \ 

*(*° /) + 2 Cl E«iH/i - /ilk(n) +E^IH/ilk 
\j=i j=i / 
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Since Pb(Jf) > y — ^ — [ see the comment after the definition of /?fi(J/)], we 
have 



—N ^ 2 AlogN 2 2 

re <r y - <r /3 b (J / ), 

where we also used the assumptions that log iV > 2 log log n and ^4 > 4. Sub- 
stituting this in (66) and then combining the resulting bound with (64) 
concludes the proof of (49) in the case when conditions (55) hold. 

It remains to consider the case when (55) does not hold. The main idea is 
to show that in this case the right-hand side of the oracle inequality is rather 
large while we still can control the left-hand side, so, the inequality becomes 
trivial. To this end, note that, by the definition of /, for some numerical 
constant c\, 

N n 
3=1 3=1 

[since the value of the penalized empirical risk at / is not larger than its value 
at / = and, by the assumptions on the loss, £(y, 0) is uniformly bounded 
by a numerical constant]. The last equation implies that, on the event E 
defined earlier in the proof [see (51), (52)], the following bound holds: 

N T (\ \ N T 2 

J2c €j [ ^ll^lli2(n)-ejll/ill«J +X^efll/jlk J :<ci. 

3=1 V 7 3=1 

Equivalently, 

t N ~ (t 2 t\ N 

3=1 3=1 

As soon as r > 2C, so that t 2 /C 2 - t/C > r 2 /(2C 2 ), we have 

N N 

(67) r^e-.ll/.ll^^+T 2 ^^ 2 !!/,-!^ <2 Cl C 2 . 

3=1 3=1 

Note also that, by the assumptions on the loss function, 
£(£of)<P(£of) 

< E£(Y; 0) + \P{£ o /) - P(£ o 0)| 

(68) 

N 

< ci + £*||/IIl 2 (ii) < ci + L^Y, ll/IU a cn) 

3=1 



< Cl + 2 Cl C z L*- 



2 T 1 ' W 



t V AlogN 1 
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where we used the Lipschitz condition on I, and also bound (67) and the 
fact that ej > y A log N/ n (by its definition). 

Recall that we are considering the case when (55) does not hold. We 
will consider two cases: (a) when e N < C3, where C3 > c\ is a numerical 
constant, and (b) when e N > C3. The first case is very simple since N and 
n are both upper bounded by a numerical constant (recall the assumption 



log N > 2 log log n). In this case, Pb(Jf) > y rdlg&JY j s bounded from below by 

a numerical constant. As a consequence of these observations, bounds (67) 
and (68) imply that 

/ n N \ 

*('° /) + Cx [Y,^M\\l 2 (u) + E^HAlk 

Vi=i i=i / 

for some numerical constant C2 > 0. In the case (b), we have 

and, in view of (67), this implies 

N N 

E SiH/iltacn) + E ^H/ilk ^ ^ " c i/ 2 ^ c *7 2 - 
i=i i=i 

So, either we have 

TV 

E^U/ilk^/ 4 

i=i 

or 

TV 

E^'ll/ill^(n)>^/4- 

Moreover, in the second case, we also have 

n I A\o N N 

3=1 i =1 



N , jAlogN 



>(e"/4) 

V n 

In both cases we can conclude that, under the assumption that log A?" > 
2 log log n and e N > C3 for a sufficiently large numerical constant C3 , 

N 

^o/) + E( r ^ll^'ll^(n)+^ll/ilk) 
j'=i 
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< ci + 2ciC 2 L*-J- , n Ar + 2ciC 2 



r V AIorN 



^TV V AlQgN 2 V- -2||, || 



J€J> 



Thus, in both cases (a) and (b), the following bound holds: 



N 



N 



(69) 



3=1 



^ C 2r 2 (^6]||/ J ||^+/3 b 2 (J / )). 

To complete the proof, observe that 

/ N N \ 

Vi=i j=i / 

(N N 
3=1 3=1 



(70) 



+ Ci 2 r£ jH^ ~ ^'Hia(n) 
3'eJ/ 



+ C 2 ^ re^ll/j - /j|| i2 (n)- 
3'eJ/ 

Note also that, by the definition of /3(,(J/), for all b > 0, 
Tf ill/j' _ /illi3(n) 

3'GJ/ 



<r/3 6 (J/) 



(71) 



<r/9 6 (J / )||/-/|| £a(n) +r/9 6 (J / ) 



AlogiV 



£ 

3^/ 



e jlli3llL 2 (n) 



< tA,(J/)||/ - /IU 2(n) + Tfk{J f pf- Jj^, 
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where we used the fact that, for all j, Ej > y Alo J= N an d also bound (67). By 

an argument similar to (61)-(64), it is easy to deduce from the last bound 
that 

ft £ Tr M - /ill«n) < Y^-ftVi) + \ev°f) + \e{i°f) 

(72) 



2c? C 4 n 



r 2 AlogN' 
Substituting this in bound (70), we get 



/ N N \ 

-£(£of) + Cx - /;|U 2 (n) + E r ^H/ilk 

\j=l j=l J 



3C 2 ¥ o2/M \ cid t . 2c 2 C A 



(73) +o^-^( J /)+o^°/) + 



n 



2 m* owy 2 v " r 2 ,41og7V 



+ 



2c?C 2 n 



T 2 AlogN' 

with some numerical constant C 2 . It is enough now to observe [considering 
again the cases (a) and (b), as it was done before], that either the last term 
is upper bounded by Y^j£j f £j\\fj\\'Hji or ^ * s upper bounded by /3 2 (Jy), to 
complete the proof. □ 

Now, to derive Theorem 2, it is enough to check that, for a numerical 
constant c> 0, 

1/2 



\ 1/2 

E £ 1J A,oo(J», 



be J f 

1/2 

which easily follows from the definitions of fib and $2,00 ■ Similarly, the proof 
of Theorem 3 follows from the fact that, under the assumption that A -1 < 

% < A, we have fZj C , where 6' = cA 2 6, c being a numerical constant. 

This easily implies the bound fib (Jf) < ciKfi 2 ^ (Jf)\/ d(f)e, where c\ is a 
numerical constant. 
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5. Bounding the empirical process. We now proceed to prove Lemma 9 
that was used to bound \(P n — P)(£ o f — £ o /)| . To this end, we begin 
with a fixed pair (A_, A+). Throughout the proof, we write R := R* D . By 
Talagrand's concentration inequality, with probability at least 1 — e~* 

sup \(P n -P)(£og-£of)\ 



< 2 E 



sup \{P n -P)(log-£of)\ 



Uog-eof\\ L2{pn /i + \\£og-£of\\ L t 



n 

Now note that 

¥og-tof\\ Lm <L4g-f\\ Lm 

N 



n 



(id 



3=1 



< U (minej j ejWgj - fj\\ L2 

3 3=1 



(n), 



where we used the fact that the Lipschitz constant of the loss £ on the range 
of functions from C/(A_, A+ , i2) is bounded by L*. Together with the fact 
that ej > (AlogN/n) 1 / 2 for all j, this yields 



(74) 

Furthermore, 



\\t°9-t°f\\L 2 (P)<Lj 



n 



AlogiV 



A. 



Uog-iofW^KL^g-fWj 



N 



3=1 

n 



,4 log TV 



:A+. 



In summary, we have 



sup \(P n -P)(£og-£of)\ 
gG<7(A_ ,A + ,R) 



< 2 E 



sup \(P n -P)(£og-£of)\ 



+ L*A_ 



t T . t n 

+ L*A_ 



^logiV 



n A log N 
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Now, by symmetrization inequality 
E 

(75) 



sup \(P n -P)(£og-£of)\ 



<2E sup \R n {£og-£of)\. 
s6 e(A_,A + ,i?) 

An application of Rademacher contraction inequality further yields 

4 sup \(P n -P)(£og-£of)\ 
(76) 

<CL*E sup \R n (g-f)\, 
geg(A_,A+,R) 

where C > is a numerical constant [again, it was used here that the Lip- 
schitz constant of the loss £ on the range of functions from C/(A_,A + ,i?) 
is bounded by L*\. Applying Talagrand's concentration inequality another 
time, we get that with probability at least 1 — e _< 

E sup \Rn(g-f)\<C[ sup \R n {g-f)\ 

96g(A„,A +1J R) \g&g(A-,A+,R) 



t . t n 

+ A_,/-n - + A_ 



,41ogA n AlogN I 

for some numerical constant C > 0. 

Recalling the definition of €j := e(Kj), we get 

(77) \Rn(hj)\ < ^j\\hj\\ L2 (u) + h j^ n r 

Hence, with probability at least 1 — 2e _i and with some numerical constant 

sup \(P n -P)(£og-£of)\ 
9 ee(A_,A+,R) 



<ClJ sup \R n ( g -f)\ + A J— 1^ + A + -— ^— ) 

\geg(A-,A+,R) \l AlogN n AlogN J 



9e e(A_;A + ,i ? )^' w 7/1 V^logA T n,41ogA 



A? 



< CL* sup E^'H* ~ ZilUaOi) + S 2 II* " /jllwj 



v see(A_,A + ,fl) i=1 



i . t n 

A_W— +A_ 



A log A n^logA^ 
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Using (46), ij can be upper bounded by c£j with some numerical con- 
stant c > on an event E of probability at least 1 — N~ A / 2 . Therefore, the 
following bound is obtained: 

sup \{P n -P){log-lof)\ 
g G g(A_,A+,R) 



t t n 

<CLj A_ + A + + A_W— - + A_ 



AlogN nAlogN I 

It holds on the event E n F(A_, A+, t), where P(F(A_, A+, t)) > 1 - 2e _ *. 

We will now choose t = AlogN + 41ogiV + 41og(2/log2) and obtain a 
bound that holds uniformly over 

(78) e-^A.^e^ and e~ N <A + <e N . 
To this end, consider 

(79) AT = A+:=2^'. 
For any Aj and A^" satisfying (78), we have 

sup \{P n -P)(£og-£of)\ 
geg(AJ,A+,R) 



A-+A+ + AT A/ T ±^ 7 + At t 



^logiV fe nAlogN I 

on the event E n F(Aj, A^~, i). Therefore, simultaneously for all A~ and 
A^" satisfying (78), we have 

sup \(P n -P)(£og-£of)\ 
geg(AJ,A+,R) 



<CLJ A7+A+ + A 



A _ /AlogiV + 41ogiV + 41og(2/log2) 



AlogiV 

,41ogiV + 41ogiV + 41og(2/log2) n 



n ^41ogiV / 

on the event E' := E n ((") - fe .F(Aj, AjJ" ,t)). The last intersection is over all 
j, k such that conditions (78) hold for A~, A^. The number of the events in 
this intersection is bounded by (2/ log 2) 2 N 2 . Therefore, 

F(E') > 1- (2/log2) 2 iV 2 exp(-AlogiV-41og7V-41og(2/log2)) 

(80) - F(E) 

> l-2N~ A/2 . 
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Using monotonicity of the functions of A_, A + involved in the inequalities, 
the bounds can be extended to the whole range of values of A_ , A + satisfying 
(78), so, with probability at least 1 — 2N~ A / 2 we have for all such A_, A + 

(81) sup \{P n -P)(tog-tof)\ <CL*(A_+A + ). 

If A_ < e~ N , or A + < e"^, it follows by monotonicity of the left-hand side 
that with the same probability 

(82) sup \(P n -P)(£og-£of)\ <CL,(A_ + A+ + e- Ar ), 
geQ(A-,A+,R) 

which completes the proof. 
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